Skip to main content

Exivity Kubernetes best practices

This document describes the recommended Kubernetes deployment patterns for Exivity in self-managed, on-premises environments. It is intended as a prescriptive starting point for implementation teams that need a default architecture for single-node, multi-node, and multi-site deployments.

The recommendations below assume a Linux Kubernetes cluster, Helm-based deployment using the Exivity chart, and self-managed infrastructure components such as ingress, storage, PostgreSQL, and RabbitMQ.

Third-party middleware

Exivity relies on third-party infrastructure and middleware to run on Kubernetes, including Kubernetes, PostgreSQL, RabbitMQ, ingress controllers such as NGINX Ingress Controller or Traefik, and storage platforms such as Longhorn or NFS-backed storage.

These products are third-party infrastructure that you operate. Exivity documents how the application depends on them, but you are responsible for selecting, operating, securing, backing up, monitoring, and supporting those third-party products according to their vendor documentation and your internal platform standards.

Deployment scenarios

ScenarioRecommended usePreferred architecture
☸️ Single-node KubernetesSmall environments, evaluation, non-HA production where simplicity is preferredOne Kubernetes node, ingress/TLS, embedded or external PostgreSQL, embedded RabbitMQ, provisioner-backed local RWO storage (or RWX if already available)
☸️ Multi-node KubernetesProduction HA deployments within one siteMulti-node Kubernetes, Longhorn RWX storage, external PostgreSQL, site-local in-cluster RabbitMQ, ingress/load balancer
☸️ Multi-site KubernetesDisaster recovery across sitesActive/passive sites with replicated PostgreSQL, independent RabbitMQ per site, independent storage per site, GitOps-controlled failover

Common foundations

Use these foundations for all Kubernetes scenarios.

AreaRecommendation
☸️ KubernetesUse a CNCF-conformant Kubernetes distribution on Linux nodes with a stable CSI driver and production ingress controller. Known Exivity deployments run on upstream Kubernetes, Rancher (RKE2/K3s), and Red Hat OpenShift. Other CNCF-conformant distributions are likely to work but should be confirmed with Exivity support before production use. Lightweight learning distributions such as Minikube, Kind, and Docker Desktop are intended for development only.
⎈ HelmUse Helm 3 and maintain deployment values in version control.
🗂️ NamespaceDeploy Exivity into a dedicated namespace, normally exivity.
🚦 Ingress / load balancerUse a production ingress controller such as NGINX or Traefik. Terminate TLS at ingress or at an upstream load balancer.
🛡️ TLSUse cert-manager, enterprise PKI, or an existing TLS secret. Do not expose production Exivity over plain HTTP.
🔐 SecretsSet secret.appKey and secret.jwtSecret explicitly for production. Do not rely on generated values.
📦 Image registryMirror images to an internal registry for restricted or air-gapped sites.
🔄 BackupsBack up PostgreSQL and Exivity shared data. Do not rely only on persistent volumes for recovery.
📈 MonitoringEnable Kubernetes, ingress, PostgreSQL, RabbitMQ, and storage monitoring. Enable the Exivity ServiceMonitor where Prometheus Operator is used.
📄 LogsTune log retention with logfiles.deleteDays and logfiles.compressDays to match your retention and storage requirements.

PVC sizing

The chart defaults are intentionally small and are usually not appropriate for production. Use the following as starting values and size upward for high-volume or multi-tenant environments.

PVC groupVolumeRecommended size
📚 Dataextracted50-100Gi
📚 Dataexported50-100Gi
📚 Dataimport10-20Gi
📚 Datareport10-20Gi
📄 LogsAll service log PVCs5-10Gi each
⚙️ Configetl, griffon, chronos1Gi
🐘 PostgreSQLEmbedded PostgreSQL or CloudNativePG instance volume25-50Gi

For CSI-backed storage such as Longhorn, PVC sizes are enforced by the storage backend. For NFS-subdir provisioners, PVC sizes may be advisory only, but they should still be set to document intent and simplify future migration to CSI-backed storage. For local-path style provisioners on single-node deployments, PVC sizes are advisory and not reserved against node disk capacity, so always validate the sum of requested PVC sizes against the actual node disk size and monitor node disk free space.

extracted and exported typically grow fastest because they depend on data source volume, retention, and report frequency. Prefer 100Gi for larger environments or high-frequency reporting.

Scenario A: single-node Kubernetes

Single-node Kubernetes is suitable when HA is not required or when you want the smallest possible Kubernetes footprint. It is also useful for demos, evaluation, and small production environments with clear recovery expectations.

Architecture

This table describes the role of each layer in the diagram above. It is descriptive, not prescriptive.

LayerRole in this scenario
👥 Users / API clientsReach Exivity through DNS and the cluster ingress endpoint.
🛡️ Ingress / TLSRoutes / to glass and API paths to proximity-api; terminates TLS.
☸️ Kubernetes nodeSingle node hosting the control-plane, worker role, all Exivity services, RabbitMQ, PostgreSQL, and shared storage.
🐘 PostgreSQLEither embedded (in-cluster) or external; both stay in the same single-node footprint.
🐇 RabbitMQEmbedded in-cluster RabbitMQ used for transient communication.
💾 Shared volumesHold logs, config, and pipeline data; mounted into the Exivity services running on this node.

Configuration

This table lists the choices to make for a single-node deployment. It is prescriptive.

DecisionRecommended value
☸️ KubernetesOne Linux node running both control-plane and worker roles.
🐘 PostgreSQLExternal PostgreSQL is preferred. Embedded PostgreSQL is acceptable for evaluation and small environments.
🐇 RabbitMQUse site-local in-cluster RabbitMQ. The embedded chart dependency is acceptable for evaluation. For production, prefer the RabbitMQ Cluster Operator running a single RabbitMQ node, because the embedded chart relies on the unsupported bitnamilegacy image. See the RabbitMQ section for details.
💾 Storage access modeRWX is not required. Set storage.sharedVolumeAccessMode: ReadWriteOnce because every Exivity pod runs on the same node.
💾 Storage classUse a provisioner-backed local StorageClass. Validated examples include Docker Desktop's hostpath, K3s' built-in local-path, and local-path-provisioner. Do not point Exivity directly at unmanaged raw hostPath volumes; always go through a StorageClass/provisioner. NAS/NFS is a valid alternative when you already operate reliable NAS, want to decouple storage from the Kubernetes node, want easier node rebuild/replacement, or anticipate migrating to multi-node later. NAS/NFS does not make Exivity HA when Kubernetes itself is still single-node. Longhorn works on a single node but provides limited HA value there because replicas cannot be spread across nodes.
🚦 Ingress / load balancerAny CNCF-conformant Kubernetes ingress controller with TLS termination is supported. Proven options include Traefik, NGINX Ingress Controller, and HAProxy Ingress. Reach the cluster through a LoadBalancer service provided by your platform (cloud provider's native load balancer or an upstream hardware load balancer). On bare-metal Kubernetes without a cloud provider, an implementation such as MetalLB can fill that role; treat it as one option among hardware and software load balancers and confirm operational fit with your platform team.
🔄 BackupsBack up PostgreSQL and shared data. Test the restore path before production handover.
Single-node disk capacity

Local-path style provisioners do not track or reserve aggregate disk capacity across PVCs on the node. The sum of requested PVC sizes can exceed available disk without Kubernetes blocking it. Size all Exivity PVCs against the actual node disk capacity, leave headroom for PostgreSQL, logs, and image growth, and monitor node disk free space.

Reference values: charts/exivity/examples/best-practice-single-node.yaml

Scenario B: multi-node Kubernetes, single site

Multi-node Kubernetes is the preferred architecture for production HA environments within one site. This is the default recommendation for larger deployments.

Architecture

This table describes the role of each layer in the diagram above. It is descriptive, not prescriptive.

LayerRole in this scenario
👥 Users / API clientsReach Exivity through an external load balancer in front of cluster ingress.
🛡️ Ingress / TLSRoutes traffic to multiple stateless Exivity replicas; terminates TLS.
☸️ Kubernetes worker nodesMultiple nodes spread across failure domains; host the Exivity application tier and middleware.
🧩 Application tierStateless services (frontend, API, backend) run with multiple replicas; workflow and ETL services run as singletons.
🐘 PostgreSQLExternal or in-cluster Kubernetes-native PostgreSQL serving the active workload from the same low-latency site.
🐇 RabbitMQSite-local in-cluster RabbitMQ used for transient communication.
💾 Shared storageRWX-capable storage shared across nodes; holds logs, config, and pipeline data.

Configuration

This table lists the choices to make for a multi-node single-site deployment. It is prescriptive.

DecisionRecommended value
☸️ KubernetesUse at least three worker nodes. For HA control-plane requirements, also use three control-plane nodes.
📍 Node placementSpread nodes across racks, chassis, failure domains, or availability zones where available.
🐘 PostgreSQLUse external PostgreSQL for production. For self-hosted Kubernetes PostgreSQL, use CloudNativePG.
🐇 RabbitMQRun RabbitMQ site-local in-cluster. Use the RabbitMQ Cluster Operator for production because the embedded chart dependency relies on the unsupported bitnamilegacy image. External or managed RabbitMQ is optional when required by your platform standards. See the RabbitMQ section for details.
💾 Storage access modeRWX is required because Exivity pods run across multiple nodes. Keep storage.sharedVolumeAccessMode: ReadWriteMany.
💾 Storage classPrefer Longhorn with three replicas per volume. An HA NAS/NFS platform that exposes RWX is a valid alternative when Longhorn or an equivalent CSI RWX storage class is not available. Avoid using a simple in-cluster NFS server (for example, the NFS Ganesha server and external provisioner backed by a single PVC) as the HA default unless its backing storage and node placement are explicitly designed for HA.
🚦 Load balancerUse a hardware load balancer or your existing load balancing platform in front of ingress. On bare-metal Kubernetes without a cloud-provided load balancer, an implementation such as MetalLB (L2 or BGP mode) can fill that role; confirm operational fit with your platform team before treating it as production-default.
👥 Application replicasScale stateless frontend/API/backend services to at least two replicas. Keep workflow and ETL-style services singleton unless Exivity confirms a scaling pattern for your workload.
📆 SchedulingUse node anti-affinity or topology spread constraints where the platform supports it.

Service replica guidance

The following is a conservative starting point. Scale after observing CPU, memory, queue depth, and report preparation behavior.

ServiceStarting replicasNotes
glass2Stateless UI.
proximityApi2Stateless API; scale horizontally behind ingress.
edify, horizon, pigeon, transcript, use2Pull work from RabbitMQ queues (REPORT, BUDGET, PIGEON/WORKFLOW_EVENT/REPORT_PUBLISHED, TRANSFORM, and EXTRACT respectively). RabbitMQ delivers each queued job to one consumer, so multiple replicas distribute load and increase throughput.
chronos, executor, griffon, proximityCli1Must remain singletons. These services own scheduling, workflow dispatch, and CLI execution, where multiple replicas would duplicate work.
Avoid running the same job twice concurrently

RabbitMQ ensures each queued message is delivered once, but it does not stop you from queueing the same logical task (the same extractor, transformer, or report for the same period) twice. Running the same task concurrently can produce overlapping writes to extracted, exported, or report, regardless of how many replicas a service has. This is a workflow-design concern, not a replica-count concern: design schedules and triggers so the same task for the same period is not enqueued in parallel.

Reference values: charts/exivity/examples/best-practice-multi-node.yaml

Scenario C: multi-site active/passive

For deployments spanning multiple physical sites, Exivity recommends active/passive. The active site runs the application and middleware. The passive site continuously receives replicated data and is promoted during a failover event.

Active/passive avoids the operational complexity of active/active PostgreSQL writes, RabbitMQ stretching, Longhorn stretching, and workflow execution conflicts.

Architecture

This table describes the role of each layer in the diagram above, comparing the active and passive sites side by side. It is descriptive, not prescriptive.

LayerActive sitePassive site
🚦 Traffic routingDNS, GSLB, or load balancer sends users to the active ingress.Standby ingress is prepared for failover but does not receive normal traffic.
🧩 Application tierExivity service replicas are greater than 0.Exivity service replicas remain 0 until failover.
🐘 PostgreSQLRuns the primary database endpoint.Receives replicated data or restores from validated backups before promotion.
🐇 RabbitMQRuns an independent site-local RabbitMQ instance.Runs a separate site-local RabbitMQ instance; RabbitMQ state is not replicated.
💾 StorageUses site-local shared storage and backup replication.Uses independent site-local storage restored or attached during failover.
☸️ GitOpsControls scaling, routing, and failover changes through versioned state.Promotes the site through the same repeatable workflow.

Configuration

This table lists the choices to make for a multi-site active/passive deployment. It is prescriptive.

DecisionRecommended value
🧩 ApplicationRun Exivity only in the active site. Keep passive-site application replicas at 0 until failover.
🐘 PostgreSQLUse active/passive PostgreSQL replication. For CloudNativePG, use a replica cluster or a supported backup/restore promotion pattern.
🐇 RabbitMQDo not stretch RabbitMQ across sites. Deploy one site-local RabbitMQ instance per site. RabbitMQ state is not replicated between sites.
💾 Longhorn / storageDo not stretch a Longhorn cluster across sites. Use independent Longhorn clusters per site and replicate data through backups or storage-layer replication supported by your platform.
🚦 DNS / load balancingUse DNS, GSLB, or your load balancing platform to route users to the active site.
☸️ Failover controlUse GitOps for repeatable failover. Argo CD with Argo Workflows or Argo Events is the preferred implementation pattern.

Required GitOps failover pattern

A multi-site deployment must have a version-controlled, tested failover workflow. The workflow should perform the following actions in order:

  1. Mark Site A unavailable and stop routing new traffic to it.
  2. Scale Site A Exivity application replicas to 0 if the cluster is reachable.
  3. Promote the Site B PostgreSQL replica or restore the latest validated backup, depending on the PostgreSQL design.
  4. Ensure Site B RabbitMQ is available and configured for Exivity.
  5. Restore or attach the required Site B shared data volumes.
  6. Scale Site B Exivity application replicas above 0.
  7. Switch DNS, GSLB, or load balancer traffic to Site B.
  8. Run application validation checks before handing the service back to users.

If you do not have GitOps practices, implement this as a documented runbook, but understand that this is not the preferred operating model. For best-practice multi-site deployments, GitOps is required to reduce failover risk and make the process repeatable.

Reference values: charts/exivity/examples/best-practice-multi-site-active-passive.yaml

Active/active across sites

Active/active across sites is discouraged and should not be used as the default architecture.

The main concerns are:

ConcernImpact
🐘 PostgreSQL write conflictsBidirectional PostgreSQL replication is complex and can introduce conflict handling requirements that Exivity does not need in active/passive mode.
📆 Workflow schedulingOnly one site should execute workflows unless there is a clear leader-election or workload partitioning design. Otherwise, work may be duplicated or events may not progress as expected.
🐇 RabbitMQ stretchingRabbitMQ clusters should not be stretched across high-latency links for this use case.
💾 Storage stretchingLonghorn should not be stretched across sites. Site-local storage is simpler and safer.
⏱️ LatencyWAN latency to PostgreSQL can significantly affect report preparation and other database-heavy operations.

If you need active/active, treat it as a custom architecture and involve Exivity engineering before committing to the design.

Middleware recommendations

The middleware products in this section are third-party dependencies. Exivity requires compatible database, message queue, storage, networking, backup, and monitoring services, but the operation and support of those services remains your responsibility or that of your chosen platform/vendor.

PostgreSQL

PostgreSQL is the most important stateful dependency. Production deployments should use external PostgreSQL rather than the embedded Bitnami dependency shipped with the chart.

Recommended options:

OptionRecommendation
🐘 Managed or standard PostgreSQLPreferred where you already operate a supported HA PostgreSQL platform.
☸️ CloudNativePGRecommended for self-hosted PostgreSQL on Kubernetes. See the CloudNativePG documentation.
🐘 Embedded Bitnami PostgreSQLAcceptable for evaluation and small single-node deployments only. Not recommended for production HA.

Starting recommendations:

SettingRecommendation
💾 Storage25-50Gi minimum. Monitor and expand before reaching 70% utilization.
🔁 ReplicationUse active/passive HA within a site or across sites.
🔄 BackupsUse PostgreSQL-native backups. For CloudNativePG, use Barman Cloud to S3-compatible object storage where available.
🛡️ TLSUse TLS for database traffic where supported by your platform.
🌐 LatencyKeep Exivity and PostgreSQL in the same low-latency site for active workloads. WAN latency around 15ms or higher can materially affect report preparation.

RabbitMQ

Exivity uses RabbitMQ for transient application communication and work coordination, not as the primary system of record. Data integrity is primarily tied to PostgreSQL and shared data volumes. For this reason, a site-local in-cluster RabbitMQ deployment is the default recommendation. If RabbitMQ fails, Kubernetes can reschedule it, and interrupted work can be retried without introducing an external middleware dependency. External or managed RabbitMQ is optional when required by your platform standards.

Bitnami RabbitMQ chart end-of-life

The Exivity Helm chart's embedded RabbitMQ dependency is based on the Bitnami RabbitMQ Helm chart. Following the Bitnami container catalog changes on September 29, 2025, the chart consumes the unsupported bitnamilegacy/rabbitmq image through the Exivity-hosted Bitnami mirror. This is a temporary compatibility measure and will not receive updates or security patches.

Treat the embedded RabbitMQ as suitable only for evaluation and small single-node deployments. For production, keep RabbitMQ site-local but run it outside the Exivity chart, preferably with the RabbitMQ Cluster Operator.

How to run site-local RabbitMQ:

ImplementationWhen to useNotes
🐇 RabbitMQ Cluster OperatorProduction default for site-local RabbitMQMaintained upstream by the RabbitMQ team; uses official rabbitmq images; declarative RabbitmqCluster CRD; supported queue types and policies.
🐇 RabbitMQ Messaging Topology OperatorOptional alongside the Cluster OperatorLets you manage vhosts, users, queues, exchanges, bindings, and policies as Kubernetes resources. See the Messaging Topology Operator overview.
🐇 Embedded chart dependencyEvaluation and small single-node onlyBased on the Bitnami chart and bitnamilegacy image; do not treat as a long-term production architecture.
🐇 Managed RabbitMQOptional, when required by platform standardsNot site-local; only choose this when you already operate a managed RabbitMQ platform. Connect Exivity through the external rabbitmq.host, rabbitmq.port, rabbitmq.vhost, and rabbitmq.secure values.

Recommended options by scenario:

ScenarioRecommendation
🐇 Single-nodeUse site-local in-cluster RabbitMQ. The embedded chart is acceptable for evaluation; prefer the RabbitMQ Cluster Operator (single node) for long-term deployments.
🐇 Multi-node single-siteUse site-local in-cluster RabbitMQ via the RabbitMQ Cluster Operator. External or managed RabbitMQ is optional when required by your standards.
🐇 Multi-siteRun one independent site-local RabbitMQ deployment per site. Do not stretch RabbitMQ across sites and do not replicate RabbitMQ state between sites. The RabbitMQ Cluster Operator is the preferred way to run each site-local deployment.

Starting recommendations:

SettingRecommendation
🐇 ClusteringKeep clustering disabled by default. Only enable clustering for a dedicated multi-node RabbitMQ design.
🐇 QueuesPrefer quorum queues for new RabbitMQ designs where compatible with your RabbitMQ version and policy model.
💾 PersistenceUse persistent storage for production RabbitMQ.
📈 MonitoringMonitor queue depth, memory, disk free space, and connection count.

Confirm final RabbitMQ values against your chosen RabbitMQ deployment method before applying production tuning.

Longhorn

Longhorn is the preferred storage provider for HA Exivity Kubernetes deployments when it is available in your environment. It is considered mature enough for production use and is generally more resilient than a standard in-cluster NFS provisioner in HA environments.

Recommended options:

ScenarioRecommendation
💾 Single-nodeRWX is not required. Use a provisioner-backed local StorageClass (Docker Desktop hostpath, K3s local-path, or local-path-provisioner) with storage.sharedVolumeAccessMode: ReadWriteOnce. NAS/NFS is a valid alternative when you already operate reliable NAS or want storage decoupled from the node, but does not by itself make Exivity HA when Kubernetes is single-node. Longhorn is possible but provides limited HA value on one node because replicas cannot be spread across nodes.
💾 Multi-node single-sitePrefer Longhorn with three replicas per volume.
💾 Multi-siteUse one independent Longhorn deployment per site. Do not stretch Longhorn across sites.

Starting recommendations:

SettingRecommendation
💾 ReplicasConfigure three replicas per volume for HA environments.
☸️ Replica placementSpread replicas across nodes and failure domains where possible.
🔄 BackupsConfigure recurring snapshots and recurring backups to an S3-compatible or otherwise approved backup target.
📊 CapacitySize disks for usable capacity after three-way replication and snapshot overhead.
💾 RWXValidate RWX behavior before production, including share-manager scheduling and failover.

Confirm final Longhorn StorageClass and recurring job values against the current chart before applying production tuning.

NFS

When NFS is used as RWX storage for Exivity, deploy the NFS Ganesha server and external provisioner (the nfs-server-provisioner Helm chart) rather than an unspecified NFS server. It serves NFSv4 with file locking, which Exivity requires, and is the reference NFS provisioner used in this documentation.

This in-cluster provisioner can work well for smaller or simpler deployments. For HA environments, prefer an external HA NAS platform that exposes NFSv4, because the in-cluster provisioner becomes a single point of failure unless its backing storage and node placement are explicitly designed for HA.

Operations checklist

Before production

CheckRequirement
💾 StorageRWX storage class validated with Exivity PVCs.
📶 PVC sizesProduction PVC sizes set explicitly.
🐘 PostgreSQLHA design, backups, restore, and monitoring validated.
🐇 RabbitMQConnectivity, authentication, TLS, and monitoring validated.
🚦 Ingress / load balancerDNS, TLS certificate, ingress class, and trusted proxy behavior validated.
🔐 SecretsProduction secret.appKey, secret.jwtSecret, PostgreSQL password, and RabbitMQ password configured.
🔄 BackupsRestore test completed.
📈 MonitoringCluster, application, PostgreSQL, RabbitMQ, ingress, and storage alerts configured.

Day-2 operations

AreaRecommendation
⎈ UpgradesRun Helm upgrades from version-controlled values. Back up PostgreSQL before upgrades.
📊 CapacityMonitor PostgreSQL, extracted, exported, and log volume growth.
📄 LogsLower retention before expanding log PVCs unnecessarily.
🔁 DRTest failover regularly for multi-site deployments.
🔐 SecurityRotate credentials according to your security policy and keep images patched.

Example values files

Use these example files as starting points, not as final production values:

ScenarioFile
Single-nodecharts/exivity/examples/best-practice-single-node.yaml
Multi-nodecharts/exivity/examples/best-practice-multi-node.yaml
Multi-site active/passivecharts/exivity/examples/best-practice-multi-site-active-passive.yaml